Project 1: Predicting the Location of Mass Shooting Events¶

In this project we use a variety of classification models to attempt to predict the location of mass shooting events. We will do this by analyzing the mental health history and signs of being in a crisis 6 moths prior to the shooting.

Libraries¶

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import openpyxl
import sys
import seaborn as sns
import plotly.express as px # graphing interactive map from data
sys.setrecursionlimit(10000000)

# Render our plots inline
%matplotlib inline
# Make the graphs a bit prettier, and bigger
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 7)# Start writing code here...# Start writing code here...

Import the Data¶

In [2]:
mass_shootings = pd.read_excel('Violence-Project-Mass-Shooter-Database-Version-5-May-2022.xlsx', sheet_name='Full Database', header=1)
mass_shootings
Out[2]:
Case # Shooter Last Name Shooter First Name Full Date Day of Week Day Month Year Shooting Location Address City ... Performance Interest in Firearms Firearm Proficiency Total Firearms Brought to the Scene Other Weapons or Gear Specify Other Weapons or Gear On-Scene Outcome Attempt to Flee Insanity Defense Criminal Sentence
0 1 Whitman Charles 1966-08-01 Monday 1 8 1966 110 Inner Campus Drive, Austin, TX 78705 Austin ... 0.0 1.0 3.0 7.0 1.0 hatchet, hammer, knives, wrench, ropes, water,... 1.0 0.0 2.0 0.0
1 2 Smith Robert 1966-11-12 Saturday 12 11 1966 Rose-Mar College of Beauty in Mesa, AZ Mesa ... 1.0 0.0 1.0 1.0 1.0 knife, nylon cord 2.0 0.0 1.0 1.0
2 3 Held Leo 1967-10-23 Monday 23 10 1967 599 South Highland Street Lockhaven, PA 17745 Lock Haven ... 0.0 1.0 3.0 2.0 1.0 holster 1.0 0.0 2.0 0.0
3 4 Pearson Eric 1968-03-16 Saturday 16 3 1968 11703 Lake Rd, Ironwood, MI 49938 Ironwood ... 0.0 0.0 0.0 1.0 0.0 NaN 2.0 0.0 0.0 3.0
4 5 Lambright Donald 1969-04-05 Saturday 5 4 1969 Pennsylvania Turnpike near Harrisburg, PA Harrisburg ... 0.0 0.0 3.0 2.0 0.0 NaN 0.0 0.0 2.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
176 178 Gaxiola Gonzalez Aminadab 2021-03-31 Wednesday 31 3 2021 202 West Lincoln Avenue Orange, CA 92865 Orange ... 0.0 0.0 0.0 1.0 1.0 pepper spray, handcuffs, ammunition, locked ex... 2.0 0.0 3.0 NaN
177 179 Hole Brandon Scott 2021-04-15 Thursday 15 4 2021 8951 Mirabel Rd, Indianapolis, IN 46241 Indianapolis ... 0.0 0.0 0.0 2.0 0.0 NaN 0.0 0.0 2.0 NaN
178 180 Cassidy Samuel 2021-05-26 Wednesday 26 5 2021 101 W Younger Ave, San Jose, CA 95110 San Jose ... 0.0 1.0 0.0 3.0 1.0 32 extended magazines 0.0 0.0 2.0 NaN
179 181 Crumbley Ethan 2021-11-30 Tuesday 30 11 2021 745 N Oxford Rd, Oxford, MI 48371 Oxford ... 0.0 1.0 1.0 1.0 0.0 NaN 2.0 0.0 3.0 NaN
180 182 Gendron Payton 2022-05-14 Saturday 14 5 2022 1275 Jefferson Ave, Buffalo, NY 14208 Buffalo ... 1.0 1.0 3.0 1.0 1.0 tactical gear, bulletproof vest, helmet 2.0 0.0 NaN NaN

181 rows × 142 columns

Select the Variables of Interest¶

In [3]:
my_data = mass_shootings[['Age', 'Gender', 'Race', 'Education','Location', 'City', 'State', 'Region', 'Suicidality', 
                        'Voluntary or Involuntary Hospitalization','Prior Hospitalization', 'Prior Counseling', 
                        'Voluntary or Mandatory Counseling', 'Recent or Ongoing Stressor', 'Signs of Being in Crisis',
                        'Timeline of Signs of Crisis', 'Leakage ', 'Leakage How', 'Leakage Who ', 'Number Killed', 'Number Injured']]
my_data
Out[3]:
Age Gender Race Education Location City State Region Suicidality Voluntary or Involuntary Hospitalization ... Prior Counseling Voluntary or Mandatory Counseling Recent or Ongoing Stressor Signs of Being in Crisis Timeline of Signs of Crisis Leakage Leakage How Leakage Who Number Killed Number Injured
0 25.0 0.0 0.0 2.0 1 Austin TX 0 2.0 0.0 ... 1.0 1 4 1.0 2.0 1.0 0 0 15 31
1 18.0 0.0 0.0 0.0 4 Mesa AZ 3 1.0 0.0 ... 0.0 0 0 1.0 3.0 0.0 NaN NaN 5 2
2 39.0 0.0 0.0 2.0 9 Lock Haven PA 2 2.0 0.0 ... 0.0 0 2 1.0 2.0 0.0 NaN NaN 6 6
3 56.0 0.0 0.0 NaN 5 Ironwood MI 0 0.0 0.0 ... 0.0 0 1 0.0 NaN 0.0 NaN NaN 7 2
4 31.0 0.0 1.0 2.0 8 Harrisburg PA 2 1.0 0.0 ... 0.0 0 2 1.0 0.0 0.0 NaN NaN 4 17
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
176 44.0 0.0 2.0 NaN 6 Orange CA 3 0.0 0.0 ... 0.0 0 0 0.0 NaN 0.0 NaN NaN 4 1
177 19.0 0.0 0.0 0.0 9 Indianapolis IN 1 1.0 0.0 ... 1.0 1 2, 4 1.0 3.0 0.0 NaN NaN 8 7
178 57.0 0.0 0.0 2.0 9 San Jose CA 3 1.0 0.0 ... 0.0 0 2 1.0 3.0 1.0 0, 2 2, 9 9 0
179 15.0 0.0 0.0 0.0 0 Oxford MI 1 0.0 0.0 ... 0.0 0 0 1.0 2.0 1.0 5, 3, 4 7, 7, 9 4 7
180 18.0 0.0 0.0 2.0 4 Buffalo NY 2 1.0 2.0 ... 1.0 2 NaN NaN NaN 1.0 2022-04-04 00:00:00 9, 6, 7 10 3

181 rows × 21 columns

Clean the Data¶

Use fillna() to replace nulls with -1 to retain all rows. A work around will include code to exclude rows with -1 values in particular regression models that utilize variables with heavy amounts of nulls so as to not skew data and still have accurate models

In [4]:
my_data = my_data.fillna(value='-1')
my_data
Out[4]:
Age Gender Race Education Location City State Region Suicidality Voluntary or Involuntary Hospitalization ... Prior Counseling Voluntary or Mandatory Counseling Recent or Ongoing Stressor Signs of Being in Crisis Timeline of Signs of Crisis Leakage Leakage How Leakage Who Number Killed Number Injured
0 25.0 0.0 0.0 2.0 1 Austin TX 0 2.0 0.0 ... 1.0 1 4 1.0 2.0 1.0 0 0 15 31
1 18.0 0.0 0.0 0.0 4 Mesa AZ 3 1.0 0.0 ... 0.0 0 0 1.0 3.0 0.0 -1 -1 5 2
2 39.0 0.0 0.0 2.0 9 Lock Haven PA 2 2.0 0.0 ... 0.0 0 2 1.0 2.0 0.0 -1 -1 6 6
3 56.0 0.0 0.0 -1 5 Ironwood MI 0 0.0 0.0 ... 0.0 0 1 0.0 -1 0.0 -1 -1 7 2
4 31.0 0.0 1.0 2.0 8 Harrisburg PA 2 1.0 0.0 ... 0.0 0 2 1.0 0.0 0.0 -1 -1 4 17
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
176 44.0 0.0 2.0 -1 6 Orange CA 3 0.0 0.0 ... 0.0 0 0 0.0 -1 0.0 -1 -1 4 1
177 19.0 0.0 0.0 0.0 9 Indianapolis IN 1 1.0 0.0 ... 1.0 1 2, 4 1.0 3.0 0.0 -1 -1 8 7
178 57.0 0.0 0.0 2.0 9 San Jose CA 3 1.0 0.0 ... 0.0 0 2 1.0 3.0 1.0 0, 2 2, 9 9 0
179 15.0 0.0 0.0 0.0 0 Oxford MI 1 0.0 0.0 ... 0.0 0 0 1.0 2.0 1.0 5, 3, 4 7, 7, 9 4 7
180 18.0 0.0 0.0 2.0 4 Buffalo NY 2 1.0 2.0 ... 1.0 2 -1 -1 -1 1.0 2022-04-04 00:00:00 9, 6, 7 10 3

181 rows × 21 columns

In [5]:
my_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 181 entries, 0 to 180
Data columns (total 21 columns):
 #   Column                                    Non-Null Count  Dtype 
---  ------                                    --------------  ----- 
 0   Age                                       181 non-null    object
 1   Gender                                    181 non-null    object
 2   Race                                      181 non-null    object
 3   Education                                 181 non-null    object
 4   Location                                  181 non-null    int64 
 5   City                                      181 non-null    object
 6   State                                     181 non-null    object
 7   Region                                    181 non-null    int64 
 8   Suicidality                               181 non-null    object
 9   Voluntary or Involuntary Hospitalization  181 non-null    object
 10  Prior Hospitalization                     181 non-null    object
 11  Prior Counseling                          181 non-null    object
 12  Voluntary or Mandatory Counseling         181 non-null    object
 13  Recent or Ongoing Stressor                181 non-null    object
 14  Signs of Being in Crisis                  181 non-null    object
 15  Timeline of Signs of Crisis               181 non-null    object
 16  Leakage                                   181 non-null    object
 17  Leakage How                               181 non-null    object
 18  Leakage Who                               181 non-null    object
 19  Number Killed                             181 non-null    int64 
 20  Number Injured                            181 non-null    int64 
dtypes: int64(4), object(17)
memory usage: 29.8+ KB

Re-Code Columns¶

We can see that there are no null or N/A values. However some rows contain multiple responses for some variables. We need to re-code them to indicate multiple responses were given.

  • Recent or Ongoing Stressors = 7
  • Voluntary or Mandatory Counseling = 3
  • Leakage How = 6
  • Leakage Who = 10
In [6]:
my_data.loc[my_data['Recent or Ongoing Stressor'].str.contains(', ', na=False), 'Recent or Ongoing Stressor'] = 7
my_data.loc[my_data['Voluntary or Mandatory Counseling'].str.contains(', ', na=False), 'Voluntary or Mandatory Counseling'] = 3
my_data.loc[my_data['Leakage How'].str.contains(', ', na=False), 'Leakage How'] = 6
my_data.loc[my_data['Leakage Who '].str.contains(', ', na=False), 'Leakage Who '] = 10
In [7]:
my_data
Out[7]:
Age Gender Race Education Location City State Region Suicidality Voluntary or Involuntary Hospitalization ... Prior Counseling Voluntary or Mandatory Counseling Recent or Ongoing Stressor Signs of Being in Crisis Timeline of Signs of Crisis Leakage Leakage How Leakage Who Number Killed Number Injured
0 25.0 0.0 0.0 2.0 1 Austin TX 0 2.0 0.0 ... 1.0 1 4 1.0 2.0 1.0 0 0 15 31
1 18.0 0.0 0.0 0.0 4 Mesa AZ 3 1.0 0.0 ... 0.0 0 0 1.0 3.0 0.0 -1 -1 5 2
2 39.0 0.0 0.0 2.0 9 Lock Haven PA 2 2.0 0.0 ... 0.0 0 2 1.0 2.0 0.0 -1 -1 6 6
3 56.0 0.0 0.0 -1 5 Ironwood MI 0 0.0 0.0 ... 0.0 0 1 0.0 -1 0.0 -1 -1 7 2
4 31.0 0.0 1.0 2.0 8 Harrisburg PA 2 1.0 0.0 ... 0.0 0 2 1.0 0.0 0.0 -1 -1 4 17
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
176 44.0 0.0 2.0 -1 6 Orange CA 3 0.0 0.0 ... 0.0 0 0 0.0 -1 0.0 -1 -1 4 1
177 19.0 0.0 0.0 0.0 9 Indianapolis IN 1 1.0 0.0 ... 1.0 1 7 1.0 3.0 0.0 -1 -1 8 7
178 57.0 0.0 0.0 2.0 9 San Jose CA 3 1.0 0.0 ... 0.0 0 2 1.0 3.0 1.0 6 10 9 0
179 15.0 0.0 0.0 0.0 0 Oxford MI 1 0.0 0.0 ... 0.0 0 0 1.0 2.0 1.0 6 10 4 7
180 18.0 0.0 0.0 2.0 4 Buffalo NY 2 1.0 2.0 ... 1.0 2 -1 -1 -1 1.0 2022-04-04 00:00:00 10 10 3

181 rows × 21 columns

Lets also re-code the location column: 0= K-12 School ---> 11.

In [8]:
my_data['Location'] = my_data['Location'].replace([0],11)

Remove Outliers¶

Row 144 has a lot of missing data, and row 152 is an outlier. Lets drop those rows.

In [9]:
# deleting that row and the vegas outlier
my_data.drop([144], axis=0, inplace = True)
my_data.drop([152], axis=0, inplace = True)
# We can also drop row 180 due to mis-matched data types
my_data.drop([180], axis=0, inplace = True)
my_data.reset_index()
Out[9]:
index Age Gender Race Education Location City State Region Suicidality ... Prior Counseling Voluntary or Mandatory Counseling Recent or Ongoing Stressor Signs of Being in Crisis Timeline of Signs of Crisis Leakage Leakage How Leakage Who Number Killed Number Injured
0 0 25.0 0.0 0.0 2.0 1 Austin TX 0 2.0 ... 1.0 1 4 1.0 2.0 1.0 0 0 15 31
1 1 18.0 0.0 0.0 0.0 4 Mesa AZ 3 1.0 ... 0.0 0 0 1.0 3.0 0.0 -1 -1 5 2
2 2 39.0 0.0 0.0 2.0 9 Lock Haven PA 2 2.0 ... 0.0 0 2 1.0 2.0 0.0 -1 -1 6 6
3 3 56.0 0.0 0.0 -1 5 Ironwood MI 0 0.0 ... 0.0 0 1 0.0 -1 0.0 -1 -1 7 2
4 4 31.0 0.0 1.0 2.0 8 Harrisburg PA 2 1.0 ... 0.0 0 2 1.0 0.0 0.0 -1 -1 4 17
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
173 175 21.0 0.0 4.0 1.0 4 Boulder CO 3 0.0 ... 0.0 0 0 1.0 3.0 0.0 -1 -1 10 1
174 176 44.0 0.0 2.0 -1 6 Orange CA 3 0.0 ... 0.0 0 0 0.0 -1 0.0 -1 -1 4 1
175 177 19.0 0.0 0.0 0.0 9 Indianapolis IN 1 1.0 ... 1.0 1 7 1.0 3.0 0.0 -1 -1 8 7
176 178 57.0 0.0 0.0 2.0 9 San Jose CA 3 1.0 ... 0.0 0 2 1.0 3.0 1.0 6 10 9 0
177 179 15.0 0.0 0.0 0.0 11 Oxford MI 1 0.0 ... 0.0 0 0 1.0 2.0 1.0 6 10 4 7

178 rows × 22 columns

In [10]:
my_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 178 entries, 0 to 179
Data columns (total 21 columns):
 #   Column                                    Non-Null Count  Dtype 
---  ------                                    --------------  ----- 
 0   Age                                       178 non-null    object
 1   Gender                                    178 non-null    object
 2   Race                                      178 non-null    object
 3   Education                                 178 non-null    object
 4   Location                                  178 non-null    int64 
 5   City                                      178 non-null    object
 6   State                                     178 non-null    object
 7   Region                                    178 non-null    int64 
 8   Suicidality                               178 non-null    object
 9   Voluntary or Involuntary Hospitalization  178 non-null    object
 10  Prior Hospitalization                     178 non-null    object
 11  Prior Counseling                          178 non-null    object
 12  Voluntary or Mandatory Counseling         178 non-null    object
 13  Recent or Ongoing Stressor                178 non-null    object
 14  Signs of Being in Crisis                  178 non-null    object
 15  Timeline of Signs of Crisis               178 non-null    object
 16  Leakage                                   178 non-null    object
 17  Leakage How                               178 non-null    object
 18  Leakage Who                               178 non-null    object
 19  Number Killed                             178 non-null    int64 
 20  Number Injured                            178 non-null    int64 
dtypes: int64(4), object(17)
memory usage: 30.6+ KB

Convert data types to integers.

In [11]:
my_data['Age'] = my_data['Age'].astype('int64')
# my_data_hm['Age'] = my_data_hm['Age'].astype('int64')
# my_data_hm['Gender'] = my_data_hm['Gender'].astype('int64')
# my_data_hm['Race'] = my_data_hm['Race'].astype('int64')
# my_data_hm['Education'] = my_data_hm['Education'].astype('int64')
# my_data_hm['Suicidality'] = my_data_hm['Suicidality'].astype('int64')
# my_data_hm['Voluntary or Involuntary Hospitalization'] = my_data_hm['Voluntary or Involuntary Hospitalization'].astype('int64')
# my_data_hm['Prior Hospitalization'] = my_data_hm['Prior Hospitalization'].astype('int64')
# my_data_hm['Prior Counseling'] = my_data_hm['Prior Counseling'].astype('int64')
# my_data_hm['Voluntary or Mandatory Counseling'] = my_data_hm['Voluntary or Mandatory Counseling'].astype('int64')
# my_data_hm['Recent or Ongoing Stressor'] = my_data_hm['Recent or Ongoing Stressor'].astype('int64')
# my_data_hm['Signs of Being in Crisis'] = my_data_hm['Signs of Being in Crisis'].astype('int64')
# my_data_hm['Timeline of Signs of Crisis'] = my_data_hm['Timeline of Signs of Crisis'].astype('int64')
# my_data_hm['Leakage '] = my_data_hm['Leakage '].astype('int64')
# my_data_hm['Leakage How'] = my_data_hm['Leakage How'].astype('int64')
# my_data_hm['Leakage Who '] = my_data_hm['Leakage Who '].astype('int64')
In [12]:
my_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 178 entries, 0 to 179
Data columns (total 21 columns):
 #   Column                                    Non-Null Count  Dtype 
---  ------                                    --------------  ----- 
 0   Age                                       178 non-null    int64 
 1   Gender                                    178 non-null    object
 2   Race                                      178 non-null    object
 3   Education                                 178 non-null    object
 4   Location                                  178 non-null    int64 
 5   City                                      178 non-null    object
 6   State                                     178 non-null    object
 7   Region                                    178 non-null    int64 
 8   Suicidality                               178 non-null    object
 9   Voluntary or Involuntary Hospitalization  178 non-null    object
 10  Prior Hospitalization                     178 non-null    object
 11  Prior Counseling                          178 non-null    object
 12  Voluntary or Mandatory Counseling         178 non-null    object
 13  Recent or Ongoing Stressor                178 non-null    object
 14  Signs of Being in Crisis                  178 non-null    object
 15  Timeline of Signs of Crisis               178 non-null    object
 16  Leakage                                   178 non-null    object
 17  Leakage How                               178 non-null    object
 18  Leakage Who                               178 non-null    object
 19  Number Killed                             178 non-null    int64 
 20  Number Injured                            178 non-null    int64 
dtypes: int64(5), object(16)
memory usage: 30.6+ KB

Lets add a column to calculate the total Casualties.

In [13]:
# sum deaths and injuries into a new column called casualties
my_data["Casualties"] = my_data["Number Killed"] + my_data["Number Injured"]
my_data.head()
Out[13]:
Age Gender Race Education Location City State Region Suicidality Voluntary or Involuntary Hospitalization ... Voluntary or Mandatory Counseling Recent or Ongoing Stressor Signs of Being in Crisis Timeline of Signs of Crisis Leakage Leakage How Leakage Who Number Killed Number Injured Casualties
0 25 0.0 0.0 2.0 1 Austin TX 0 2.0 0.0 ... 1 4 1.0 2.0 1.0 0 0 15 31 46
1 18 0.0 0.0 0.0 4 Mesa AZ 3 1.0 0.0 ... 0 0 1.0 3.0 0.0 -1 -1 5 2 7
2 39 0.0 0.0 2.0 9 Lock Haven PA 2 2.0 0.0 ... 0 2 1.0 2.0 0.0 -1 -1 6 6 12
3 56 0.0 0.0 -1 5 Ironwood MI 0 0.0 0.0 ... 0 1 0.0 -1 0.0 -1 -1 7 2 9
4 31 0.0 1.0 2.0 8 Harrisburg PA 2 1.0 0.0 ... 0 2 1.0 0.0 0.0 -1 -1 4 17 21

5 rows × 22 columns

In [14]:
# checking that data is cleaned better
my_data.isna().sum()
Out[14]:
Age                                         0
Gender                                      0
Race                                        0
Education                                   0
Location                                    0
City                                        0
State                                       0
Region                                      0
Suicidality                                 0
Voluntary or Involuntary Hospitalization    0
Prior Hospitalization                       0
Prior Counseling                            0
Voluntary or Mandatory Counseling           0
Recent or Ongoing Stressor                  0
Signs of Being in Crisis                    0
Timeline of Signs of Crisis                 0
Leakage                                     0
Leakage How                                 0
Leakage Who                                 0
Number Killed                               0
Number Injured                              0
Casualties                                  0
dtype: int64

All N/A's have been removed and the data is sufficiently clean.

Exploratory Data Analysis¶

In [15]:
my_data_hm = my_data
my_data_hm = my_data_hm.drop(["State", 'City'], axis=1)
my_data_hm = my_data_hm.astype('int64')
In [16]:
my_data_hm.describe()
Out[16]:
Age Gender Race Education Location Region Suicidality Voluntary or Involuntary Hospitalization Prior Hospitalization Prior Counseling Voluntary or Mandatory Counseling Recent or Ongoing Stressor Signs of Being in Crisis Timeline of Signs of Crisis Leakage Leakage How Leakage Who Number Killed Number Injured Casualties
count 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000
mean 33.741573 0.022472 0.848315 0.904494 5.983146 1.438202 1.117978 0.370787 0.191011 0.292135 0.415730 3.061798 0.825843 1.432584 0.443820 0.101124 1.820225 6.910112 6.477528 13.387640
std 12.180403 0.148631 1.408017 1.524520 2.750397 1.284018 0.818302 0.764790 0.394207 0.456027 0.733523 2.774297 0.422537 1.498947 0.498235 1.889878 3.801759 5.432982 10.041836 13.807444
min 11.000000 0.000000 -1.000000 -1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -1.000000 0.000000 -1.000000 -1.000000 4.000000 0.000000 4.000000
25% 24.000000 0.000000 0.000000 -1.000000 4.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 -1.000000 -1.000000 4.000000 1.000000 6.000000
50% 33.000000 0.000000 0.000000 1.000000 6.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 2.000000 1.000000 2.000000 0.000000 -1.000000 -1.000000 5.000000 3.000000 8.000000
75% 42.750000 0.000000 1.000000 2.000000 8.000000 3.000000 2.000000 0.000000 0.000000 1.000000 1.000000 7.000000 1.000000 3.000000 1.000000 0.000000 4.000000 7.000000 7.000000 15.000000
max 70.000000 1.000000 6.000000 4.000000 11.000000 3.000000 2.000000 2.000000 1.000000 1.000000 3.000000 7.000000 3.000000 3.000000 1.000000 6.000000 10.000000 49.000000 70.000000 102.000000
In [17]:
mask = np.zeros_like(my_data_hm.corr())
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize = (24,16))
sns.heatmap(my_data_hm.corr(), mask=mask, annot=True, cmap="RdYlGn", linewidths=.75)
Out[17]:
<AxesSubplot:>
In [18]:
my_data.describe()
Out[18]:
Age Location Region Number Killed Number Injured Casualties
count 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000
mean 33.741573 5.983146 1.438202 6.910112 6.477528 13.387640
std 12.180403 2.750397 1.284018 5.432982 10.041836 13.807444
min 11.000000 1.000000 0.000000 4.000000 0.000000 4.000000
25% 24.000000 4.000000 0.000000 4.000000 1.000000 6.000000
50% 33.000000 6.000000 1.000000 5.000000 3.000000 8.000000
75% 42.750000 8.000000 3.000000 7.000000 7.000000 15.000000
max 70.000000 11.000000 3.000000 49.000000 70.000000 102.000000

We can see that the average number of casualties is 13, and the average age of the shooter is 34 with a standard deviation of 12 years.

In [19]:
my_data
Out[19]:
Age Gender Race Education Location City State Region Suicidality Voluntary or Involuntary Hospitalization ... Voluntary or Mandatory Counseling Recent or Ongoing Stressor Signs of Being in Crisis Timeline of Signs of Crisis Leakage Leakage How Leakage Who Number Killed Number Injured Casualties
0 25 0.0 0.0 2.0 1 Austin TX 0 2.0 0.0 ... 1 4 1.0 2.0 1.0 0 0 15 31 46
1 18 0.0 0.0 0.0 4 Mesa AZ 3 1.0 0.0 ... 0 0 1.0 3.0 0.0 -1 -1 5 2 7
2 39 0.0 0.0 2.0 9 Lock Haven PA 2 2.0 0.0 ... 0 2 1.0 2.0 0.0 -1 -1 6 6 12
3 56 0.0 0.0 -1 5 Ironwood MI 0 0.0 0.0 ... 0 1 0.0 -1 0.0 -1 -1 7 2 9
4 31 0.0 1.0 2.0 8 Harrisburg PA 2 1.0 0.0 ... 0 2 1.0 0.0 0.0 -1 -1 4 17 21
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
175 21 0.0 4.0 1.0 4 Boulder CO 3 0.0 0.0 ... 0 0 1.0 3.0 0.0 -1 -1 10 1 11
176 44 0.0 2.0 -1 6 Orange CA 3 0.0 0.0 ... 0 0 0.0 -1 0.0 -1 -1 4 1 5
177 19 0.0 0.0 0.0 9 Indianapolis IN 1 1.0 0.0 ... 1 7 1.0 3.0 0.0 -1 -1 8 7 15
178 57 0.0 0.0 2.0 9 San Jose CA 3 1.0 0.0 ... 0 2 1.0 3.0 1.0 6 10 9 0 9
179 15 0.0 0.0 0.0 11 Oxford MI 1 0.0 0.0 ... 0 0 1.0 2.0 1.0 6 10 4 7 11

178 rows × 22 columns

In [20]:
my_data['Casualties'].value_counts()
Out[20]:
6      22
4      20
8      19
7      18
5      15
9      14
11      8
10      7
14      4
20      4
36      4
17      4
16      4
15      4
12      4
21      3
45      3
25      2
46      2
35      1
28      1
49      1
33      1
23      1
34      1
48      1
102     1
82      1
40      1
19      1
13      1
58      1
24      1
29      1
27      1
30      1
Name: Casualties, dtype: int64

This shows us that most shootings in the data set have less than 10 casualties.

In [21]:
sns.set_theme(style="darkgrid")
sns.countplot(y=my_data["Location"], data=my_data, palette="mako_r")
plt.ylabel('Location')
plt.xlabel('Total')
plt.yticks([0, 1,2,3,4,5,6,7,8,9,10,], [ 'K-12 school','College/university','Government building / \nplace of civic importance',
                    'House of worship','Retail','Restaurant/bar/nightclub','Office','Place of residence',
                    'Outdoors','Warehouse/factory', 'Post office'])
plt.title('Location of Mass Shootings')
plt.show()
In [22]:
fig = px.bar(my_data, x='Casualties',y='Suicidality', height=500, width=600)
fig.update_layout(
    template="seaborn",barmode='stack', xaxis={'categoryorder':'total descending'},
    title='Total Casualties by Location',
    yaxis = dict(
        tickmode = 'array',
        tickvals = [0, 1, 2],
        ticktext = ['No evidence','Yes Prior', 'Not Prior']
    )
)
fig
In [23]:
fig = px.histogram(my_data, x='Location',color='Prior Hospitalization', height=600, width=850)
fig.update_layout(
    template="seaborn",barmode='group', 
    title='Distribution of Prior Hospitalization by Location',
    xaxis = dict(
        tickmode = 'array',
        tickvals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
        ticktext = ['College/university','Government building / \nplace of civic importance',
                    'House of worship','Retail','Restaurant/bar/nightclub','Office','Place of residence','Outdoors',
                    'Warehouse/factory', 'Post office', 'K-12 school']
    )
)

fig
  • Prior Hospitalization: 0=No Evidence, 1=Yes.
  • Location: College/university = 1, Government building / place of civic importance = 2, House of worship = 3, Retail = 4, Restaurant/bar/nightclub = 5, Office = 6, Place of residence = 7, Outdoors = 8, Warehouse/factory = 9, Post office = 10, K-12 school = 11.

There are not many records of prior hospitalization records in the data set. The most number of records come from location 4(Retail) with 6. Location 1(College/Universities) is the only location were the majority of cases have a record of prior hospitalization.

In [24]:
fig = px.histogram(my_data, x='Location',color='Voluntary or Involuntary Hospitalization', height=600, width=850)
fig.update_layout(
    template="seaborn",barmode='group',
    title='Distribution of Location by Voluntary or Involuntary Hospitalization',
    xaxis = dict(
        tickmode = 'array',
        tickvals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
        ticktext = ['College/university','Government building / \nplace of civic importance',
                    'House of worship','Retail','Restaurant/bar/nightclub','Office','Place of residence',
                    'Outdoors','Warehouse/factory', 'Post office', 'K-12 school']
    )
)
fig
  • Voluntary or Involuntary Hospitalization: 0=No Evidence, 1=Voluntary, 2= Involuntary.
  • Location: College/university = 1, Government building / place of civic importance = 2, House of worship = 3, Retail = 4, Restaurant/bar/nightclub = 5, Office = 6, Place of residence = 7, Outdoors = 8, Warehouse/factory = 9, Post office = 10, K-12 school = 11.

The only locations with a record of prior voluntary hospitalization are Retail, Place of residence, Outdoors, Warehouse/factory.

In [25]:
fig = px.histogram(my_data, x='Location',color='Suicidality', height=600, width=850)
fig.update_layout(
    template="seaborn",barmode='group',
    title='Distribution of Suicidality by Location',
    xaxis = dict(
        tickmode = 'array',
        tickvals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
        ticktext = ['College/university','Government building / \nplace of civic importance',
                    'House of worship','Retail','Restaurant/bar/nightclub','Office','Place of residence',
                    'Outdoors','Warehouse/factory', 'Post office', 'K-12 school']
    )
)
fig
  • Suicidality: 0=No Evidence, 1=Yes, at any point before the shooting, 2= Intended to die in shooting but had no previous suicidality.
  • Location: College/university = 1, Government building / place of civic importance = 2, House of worship = 3, Retail = 4, Restaurant/bar/nightclub = 5, Office = 6, Place of residence = 7, Outdoors = 8, Warehouse/factory = 9, Post office = 10, K-12 school = 11.

K-12 school shooters were the most likely to have been suicidal prior to the shooting. Warehouse/factory and Retail shooters were not suicidal prior to the shooting, however they did intend to die in the shooting.

In [26]:
fig = px.histogram(my_data, x='Location',color='Signs of Being in Crisis', height=600, width=850)
fig.update_layout(
    template="seaborn",barmode='group',
    title='Distribution of Signs of Being in Crisis by Location',
    xaxis = dict(
        tickmode = 'array',
        tickvals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
        ticktext = ['College/university','Government building / \nplace of civic importance',
                    'House of worship','Retail','Restaurant/bar/nightclub','Office','Place of residence',
                    'Outdoors','Warehouse/factory', 'Post office', 'K-12 school']
    )
)
fig

Shooters of Retail, Restaurant/bar/nightclub, Office locations most often displayed signs of being in a crisis.

In [27]:
fig = px.histogram(my_data, x='Location',color='Timeline of Signs of Crisis', height=600, width=850)
fig.update_layout(
    template="seaborn",barmode='group', 
    title='Distribution of Timeline of Signs of Crisis by Location',
    xaxis = dict(
        tickmode = 'array',
        tickvals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
        ticktext = ['College/university','Government building / \nplace of civic importance',
                    'House of worship','Retail','Restaurant/bar/nightclub','Office','Place of residence',
                    'Outdoors','Warehouse/factory', 'Post office', 'K-12 school']
    )
)
fig

Timeline of Signs of Crisis: N/A = -1, Days before shooting = 0, Weeks before shooting = 1, Months before shooting = 2, Years before shooting = 3.

Shooters of retail locations most often displayed signs of being in a crisis years prior. Shooters of outdoor locations are the most implosive showing signs of being in a crisis only days prior to the event.

In [28]:
fig = px.scatter(my_data, x='Location',y='Age', size='Casualties', height=600, width=900)
fig.update_layout(
    template="seaborn",barmode='group',
    title="Distribution of Number of Casualties by Location and Age",
    xaxis = dict(
        tickmode = 'array',
        tickvals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
        ticktext = ['College/university','Government building / \nplace of civic importance',
                    'House of worship','Retail','Restaurant/bar/nightclub','Office','Place of residence',
                    'Outdoors','Warehouse/factory', 'Post office', 'K-12 school']
    )
)
fig
  • Location: College/university = 1, Government building / place of civic importance = 2, House of worship = 3, Retail = 4, Restaurant/bar/nightclub = 5, Office = 6, Place of residence = 7, Outdoors = 8, Warehouse/factory = 9, Post office = 10, K-12 school = 11.

Most shooter of K-12 schools are under the age of 20.

In [29]:
my_data.loc[(my_data['Prior Hospitalization']==1) & (my_data['Voluntary or Involuntary Hospitalization']==1),
            'Location'].value_counts().plot(kind='pie',autopct='%1.1f%%',title='Voluntary Prior Hospitalization')
Out[29]:
<AxesSubplot:title={'center':'Voluntary Prior Hospitalization'}, ylabel='Location'>

There are only 1 instances of Voluntary Prior Hospitalization in each of 4 locations: Retail = 4, Place of residence = 7, Outdoors = 8, Warehouse/factory = 9.

In [30]:
my_data.loc[(my_data['Prior Hospitalization']==1) & (my_data['Voluntary or Involuntary Hospitalization']==2),
            'Location'].value_counts().plot(kind='pie',autopct='%1.1f%%',title='Involuntary Prior Hospitalization')
            
Out[30]:
<AxesSubplot:title={'center':'Involuntary Prior Hospitalization'}, ylabel='Location'>

Retail = 4 and College/university = 1 have the highest percentage(16.7%) of Involuntary Prior Hospitalization.

In [31]:
plt.pie(my_data['Timeline of Signs of Crisis'].value_counts(), 
        labels = ['N/A', 'Days', "Weeks", "Months", "Years"], autopct='%1.1f%%')
plt.title('Timeline of Signs of Crisis')
Out[31]:
Text(0.5, 1.0, 'Timeline of Signs of Crisis')

80% of shooters showed signs of being in a crisis, with almost 25% just days prior to the shooting.

In [32]:
sns.countplot(data=my_data, x="State", hue="Gender")
plt.title('Number of Shootings By State', fontsize=18)
Out[32]:
Text(0.5, 1.0, 'Number of Shootings By State')

The top three states with the most shootings are California, Texas, and Florida.

In [33]:
fig = px.scatter(my_data, x='Recent or Ongoing Stressor',y='Casualties',  height=600, width=850)
fig.update_layout(
    template="seaborn",barmode='group', 
    title="Distribution of Number of Casualties by Location and Age",
    xaxis = dict(
        tickmode = 'array',
        tickvals = [0, 1, 2, 3, 4, 5, 6, 7],
        ticktext = ['No evidence','Recent break-up','Employment stressor','Economic stressor',
                    'Family issue','Legal issue','Other','Multiple Stressors']
    )
)
fig

Recent or Ongoing Stressors: No evidence = 0, Recent break-up = 1, Employment stressors = 2, Economic stressors = 3, Family issue = 4, Legal issue = 5, Other = 6, Multiple = 7.

Employment Stressors are the most common stressors. Economic stressors are the least common.


Lets make a data frame of the total number of casualties per state.

In [34]:
state_shootings = my_data[['State', 'Casualties']]
state_shootings = state_shootings.groupby('State').sum('Casualties').reset_index()
In [35]:
state_shootings
Out[35]:
State Casualties
0 AK 21
1 AL 5
2 AR 36
3 AZ 30
4 CA 380
5 CO 186
6 CT 42
7 DC 20
8 FL 273
9 GA 41
10 HI 7
11 IA 6
12 ID 4
13 IL 53
14 IN 24
15 KS 7
16 KY 44
17 LA 29
18 MA 7
19 MD 8
20 MI 42
21 MN 24
22 MO 14
23 MS 26
24 NC 50
25 NE 13
26 NH 8
27 NJ 36
28 NV 16
29 NY 87
30 OH 60
31 OK 20
32 OR 77
33 PA 82
34 RI 4
35 SC 16
36 TN 15
37 TX 381
38 UT 9
39 VA 74
40 WA 69
41 WI 37

Now lets plot the the total number of casualties per state.

In [36]:
fig = px.choropleth(state_shootings, 
                    locations="State",  # DataFrame column with locations
                    color="Casualties",  # DataFrame column with color values
                    hover_name="State", # DataFrame column hover info
                    locationmode = 'USA-states') # Set to plot as US States
fig.update_layout(title_text = 'Mass Shoting Casualties by State', geo_scope='usa',)

fig.show()

We can see that the states with the most casualties are Texas(381) followed by California(380), Florida(273), and Colorado(186). Note: This does not take into account the number of shootings or the states' population.

In [37]:
# group by race
my_data.groupby("Location").size().sort_values()
Out[37]:
Location
10     4
1      9
2      9
3     11
8     14
11    14
7     15
6     18
5     25
9     25
4     34
dtype: int64
In [38]:
fig = plt.figure()

# Divide the figure into a 1x2 grid, and give me the first section
ax1 = fig.add_subplot(121)

# Divide the figure into a 1x2 grid, and give me the second section
ax2 = fig.add_subplot(122)

s=my_data.Location.value_counts(normalize=True).mul(100) # mul(100) is == *100
s.index.name,s.name='Location','percentage' #setting the name of index and series
#series.to_frame() returns a dataframe
s.to_frame().plot(kind='bar', ax=ax1, ylim=[0,100])


s=my_data.Race.value_counts(normalize=True).mul(100) # mul(100) is == *100
s.index.name,s.name='Race','percentage' #setting the name of index and series
#series.to_frame() returns a dataframe
s.to_frame().plot(kind='bar', ax = ax2, ylim=[0,100], width=0.15)
Out[38]:
<AxesSubplot:xlabel='Race'>

Retail stores are the most common locations for mass shootings. And the most common race of the shooter is White.

Prepare the Data to Model¶

In [39]:
my_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 178 entries, 0 to 179
Data columns (total 22 columns):
 #   Column                                    Non-Null Count  Dtype 
---  ------                                    --------------  ----- 
 0   Age                                       178 non-null    int64 
 1   Gender                                    178 non-null    object
 2   Race                                      178 non-null    object
 3   Education                                 178 non-null    object
 4   Location                                  178 non-null    int64 
 5   City                                      178 non-null    object
 6   State                                     178 non-null    object
 7   Region                                    178 non-null    int64 
 8   Suicidality                               178 non-null    object
 9   Voluntary or Involuntary Hospitalization  178 non-null    object
 10  Prior Hospitalization                     178 non-null    object
 11  Prior Counseling                          178 non-null    object
 12  Voluntary or Mandatory Counseling         178 non-null    object
 13  Recent or Ongoing Stressor                178 non-null    object
 14  Signs of Being in Crisis                  178 non-null    object
 15  Timeline of Signs of Crisis               178 non-null    object
 16  Leakage                                   178 non-null    object
 17  Leakage How                               178 non-null    object
 18  Leakage Who                               178 non-null    object
 19  Number Killed                             178 non-null    int64 
 20  Number Injured                            178 non-null    int64 
 21  Casualties                                178 non-null    int64 
dtypes: int64(6), object(16)
memory usage: 32.0+ KB
In [40]:
model_df = my_data[['Age', 'Gender', 'Race', 'Location', 'Suicidality', 'Voluntary or Involuntary Hospitalization','Prior Hospitalization', 
            'Prior Counseling', 'Voluntary or Mandatory Counseling', 'Recent or Ongoing Stressor',
            'Signs of Being in Crisis','Timeline of Signs of Crisis', 'Leakage ', 'Leakage How', 
            'Leakage Who ', 'Number Killed', 'Number Injured', 'Casualties']]

Data Binning¶

In [41]:
bin_age = [0, 19, 29, 39, 49, 59, 69, 80]
category_age = ['<20s', '20s', '30s', '40s', '50s', '60s', '>60s']
model_df['Age_binned'] = pd.cut(model_df['Age'], bins=bin_age, labels=category_age)
model_df = model_df.drop(['Age'], axis = 1)
/var/folders/24/536gs7r91qzd964t2ppqhs6m0000gn/T/ipykernel_31475/2989347415.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [42]:
bin_Number_Killed = [0, 4, 9, 50]
category_Number_Killed = ['Low', 'Medium', 'High']
model_df['Number_Killed_binned'] = pd.cut(model_df['Number Killed'], bins=bin_Number_Killed, labels=category_Number_Killed)
model_df = model_df.drop(['Number Killed'], axis = 1)
In [43]:
bin_Number_Injured = [0, 9, 29, 70]
category_Number_Injured = ['Low', 'Medium', 'High']
model_df['Number_Injured_binned'] = pd.cut(model_df['Number Injured'], bins=bin_Number_Injured, labels=category_Number_Injured)
model_df = model_df.drop(['Number Injured'], axis = 1)
In [44]:
bin_Casualties = [0, 14, 44, 110]
category_Casualties = ['Low', 'Medium', 'High']
model_df['Casualties_binned'] = pd.cut(model_df['Casualties'], bins=bin_Casualties, labels=category_Casualties)
model_df = model_df.drop(['Casualties'], axis = 1)

Set the Dummy Variables¶

We need to seperate the response variable from the predictor variables.

In [45]:
X = model_df.drop(["Location"], axis=1)
y = model_df["Location"]
In [46]:
X = pd.get_dummies(X)
/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/pandas/core/algorithms.py:798: FutureWarning:

In a future version, the Index constructor will not infer numeric dtypes when passed object-dtype sequences (matching Series behavior)

Split the Data¶

In [47]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

To increase the accuracy of our models, because we are using a small data set, we will split 33% test sample and 67% training sample.

In [48]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 0)
In [49]:
print('The Shape Of The Original Data: ', model_df.shape)
print('The Shape Of x_test: ', x_test.shape)
print('The Shape Of x_train: ', x_train.shape)
print('The Shape Of y_test: ', y_test.shape)
print('The Shape Of y_train: ', y_train.shape)
The Shape Of The Original Data:  (178, 18)
The Shape Of x_test:  (59, 78)
The Shape Of x_train:  (119, 78)
The Shape Of y_test:  (59,)
The Shape Of y_train:  (119,)

This confirms that the test sample is 33% of the original data set.

Balance the Training Set Using the SMOTE Technique¶

SMOTE or Synthetic Minority Oversampling Technique is an oversampling technique but SMOTE working differently than your typical oversampling.

In a classic oversampling technique, the minority data is duplicated from the minority data population. While it increases the number of data, it does not give any new information or variation to the machine learning model.

SMOTE works by utilizing a k-nearest neighbor algorithm to create synthetic data. SMOTE first start by choosing random data from the minority class, then k-nearest neighbors from the data are set. Synthetic data would then be made between the random data and the randomly selected k-nearest neighbor.

In [50]:
sns.set_theme(style="darkgrid")
sns.countplot(y=y_train, data=model_df, palette="mako_r")
plt.ylabel('Location')
plt.xlabel('Total')
plt.yticks([0, 1,2,3,4,5,6,7,8,9,10,], [ 'K-12 school','College/university','Government building / \nplace of civic importance',
                    'House of worship','Retail','Restaurant/bar/nightclub','Office','Place of residence',
                    'Outdoors','Warehouse/factory', 'Post office'])
plt.title('Unbalanced Data')
plt.show()

This graph shows us that the training data set in not balanced.

In [51]:
from imblearn.over_sampling import SMOTE
x_train, y_train = SMOTE(k_neighbors=1).fit_resample(x_train, y_train)
In [52]:
sns.set_theme(style="darkgrid")
sns.countplot(y=y_train, data=model_df, palette="mako_r")
plt.ylabel('Location')
plt.xlabel('Total')
plt.yticks([0, 1,2,3,4,5,6,7,8,9,10,], [ 'K-12 school','College/university','Government building / \nplace of civic importance',
                    'House of worship','Retail','Restaurant/bar/nightclub','Office','Place of residence',
                    'Outdoors','Warehouse/factory', 'Post office'])
plt.title('Balanced Data')
plt.show()

This shows us that the training set has been balanced to the distribution of Location.

Models¶

Logistic Regression¶

In [53]:
from sklearn.linear_model import LogisticRegression

LRclassifier = LogisticRegression(solver='liblinear', max_iter=5000)
LRclassifier.fit(x_train, y_train)

y_pred = LRclassifier.predict(x_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

from sklearn.metrics import accuracy_score

LRAcc = accuracy_score(y_pred,y_test)
print('Logistic Regression accuracy is: {:.2f}%'.format(LRAcc*100))
              precision    recall  f1-score   support

           1       0.00      0.00      0.00         4
           2       0.00      0.00      0.00         5
           3       0.00      0.00      0.00         7
           4       0.24      0.62      0.34         8
           5       0.13      0.33      0.19         6
           6       0.00      0.00      0.00        10
           7       0.00      0.00      0.00         2
           8       0.50      0.17      0.25         6
           9       0.43      0.43      0.43         7
          11       0.43      0.75      0.55         4

    accuracy                           0.24        59
   macro avg       0.17      0.23      0.18        59
weighted avg       0.18      0.24      0.18        59

[[0 0 0 2 1 0 0 0 0 1]
 [1 0 0 3 1 0 0 0 0 0]
 [1 0 0 1 1 0 2 1 0 1]
 [0 0 0 5 3 0 0 0 0 0]
 [0 0 1 2 2 0 0 0 0 1]
 [0 1 0 3 3 0 0 0 3 0]
 [0 0 0 1 0 0 0 0 1 0]
 [0 0 0 2 3 0 0 1 0 0]
 [0 0 0 2 0 1 0 0 3 1]
 [0 0 0 0 1 0 0 0 0 3]]
Logistic Regression accuracy is: 23.73%

K-Nearest Neighbors¶

In [54]:
from sklearn.neighbors import KNeighborsClassifier

KNclassifier = KNeighborsClassifier(n_neighbors=20)
KNclassifier.fit(x_train, y_train)

y_pred = KNclassifier.predict(x_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

KNAcc = accuracy_score(y_pred,y_test)
print('K Neighbours accuracy is: {:.2f}%'.format(KNAcc*100))
              precision    recall  f1-score   support

           1       1.00      0.25      0.40         4
           2       0.00      0.00      0.00         5
           3       0.00      0.00      0.00         7
           4       0.17      0.50      0.25         8
           5       0.12      0.33      0.18         6
           6       0.00      0.00      0.00        10
           7       0.00      0.00      0.00         2
           8       0.00      0.00      0.00         6
           9       0.25      0.43      0.32         7
          11       0.50      0.50      0.50         4

    accuracy                           0.20        59
   macro avg       0.20      0.20      0.16        59
weighted avg       0.17      0.20      0.15        59

[[1 0 0 1 0 0 0 0 1 1]
 [0 0 0 3 1 0 0 0 1 0]
 [0 0 0 2 1 0 2 0 1 1]
 [0 0 0 4 3 0 0 0 1 0]
 [0 0 0 4 2 0 0 0 0 0]
 [0 0 0 2 5 0 0 0 3 0]
 [0 0 0 0 0 0 0 0 2 0]
 [0 0 0 3 3 0 0 0 0 0]
 [0 0 0 3 1 0 0 0 3 0]
 [0 0 0 2 0 0 0 0 0 2]]
K Neighbours accuracy is: 20.34%
/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

Support Vector Machine (SVM)¶

In [55]:
from sklearn.svm import SVC

SVCclassifier = SVC(kernel='linear', max_iter=251)
SVCclassifier.fit(x_train, y_train)

y_pred = SVCclassifier.predict(x_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

SVCAcc = accuracy_score(y_pred,y_test)
print('SVC accuracy is: {:.2f}%'.format(SVCAcc*100))
              precision    recall  f1-score   support

           1       0.50      0.75      0.60         4
           2       0.00      0.00      0.00         5
           3       0.00      0.00      0.00         7
           4       0.22      0.50      0.31         8
           5       0.09      0.17      0.12         6
           6       0.20      0.10      0.13        10
           7       0.00      0.00      0.00         2
           8       0.00      0.00      0.00         6
           9       0.50      0.43      0.46         7
          10       0.00      0.00      0.00         0
          11       0.43      0.75      0.55         4

    accuracy                           0.25        59
   macro avg       0.18      0.25      0.20        59
weighted avg       0.20      0.25      0.21        59

[[3 0 0 0 0 0 0 0 0 0 1]
 [2 0 0 2 1 0 0 0 0 0 0]
 [1 0 0 1 0 1 2 0 1 0 1]
 [0 0 1 4 2 1 0 0 0 0 0]
 [0 0 1 2 1 2 0 0 0 0 0]
 [0 1 0 3 2 1 0 0 2 0 1]
 [0 0 0 2 0 0 0 0 0 0 0]
 [0 0 0 2 4 0 0 0 0 0 0]
 [0 0 0 2 0 0 0 0 3 1 1]
 [0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 3]]
SVC accuracy is: 25.42%
/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/svm/_base.py:301: ConvergenceWarning:

Solver terminated early (max_iter=251).  Consider pre-processing your data with StandardScaler or MinMaxScaler.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

Naive Bayes¶

In [56]:
from sklearn.naive_bayes import GaussianNB

NBclassifier2 = GaussianNB()
NBclassifier2.fit(x_train, y_train)

y_pred = NBclassifier2.predict(x_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

NBAcc2 = accuracy_score(y_pred,y_test)
print('Gaussian Naive Bayes accuracy is: {:.2f}%'.format(NBAcc2*100))
              precision    recall  f1-score   support

           1       0.33      0.25      0.29         4
           2       0.00      0.00      0.00         5
           3       0.00      0.00      0.00         7
           4       0.10      0.25      0.14         8
           5       0.10      0.33      0.15         6
           6       0.00      0.00      0.00        10
           7       0.00      0.00      0.00         2
           8       0.00      0.00      0.00         6
           9       0.43      0.43      0.43         7
          10       0.00      0.00      0.00         0
          11       0.00      0.00      0.00         4

    accuracy                           0.14        59
   macro avg       0.09      0.11      0.09        59
weighted avg       0.10      0.14      0.10        59

[[1 0 0 1 1 0 0 0 0 0 1]
 [0 0 0 4 1 0 0 0 0 0 0]
 [1 0 0 1 5 0 0 0 0 0 0]
 [0 1 0 2 3 0 0 0 1 1 0]
 [1 0 0 2 2 0 0 0 1 0 0]
 [0 1 0 3 3 0 0 0 2 1 0]
 [0 0 0 2 0 0 0 0 0 0 0]
 [0 0 0 2 4 0 0 0 0 0 0]
 [0 0 0 1 0 0 2 1 3 0 0]
 [0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 3 1 0 0 0 0 0 0]]
Gaussian Naive Bayes accuracy is: 13.56%
/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

Decision Tree¶

In [57]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

DTclassifier = DecisionTreeClassifier(max_leaf_nodes=20)
DTclassifier = DTclassifier.fit(x_train, y_train)

y_pred = DTclassifier.predict(x_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

DTAcc = accuracy_score(y_pred,y_test)
print('Decision Tree accuracy is: {:.2f}%'.format(DTAcc*100))
              precision    recall  f1-score   support

           1       0.00      0.00      0.00         4
           2       0.00      0.00      0.00         5
           3       0.00      0.00      0.00         7
           4       0.15      0.62      0.24         8
           5       0.08      0.17      0.11         6
           6       0.00      0.00      0.00        10
           7       0.00      0.00      0.00         2
           8       0.00      0.00      0.00         6
           9       0.50      0.29      0.36         7
          10       0.00      0.00      0.00         0
          11       0.60      0.75      0.67         4

    accuracy                           0.19        59
   macro avg       0.12      0.17      0.13        59
weighted avg       0.13      0.19      0.13        59

[[0 0 0 4 0 0 0 0 0 0 0]
 [0 0 0 2 2 0 0 0 0 1 0]
 [0 0 0 5 1 0 0 0 1 0 0]
 [0 0 0 5 2 0 0 0 0 0 1]
 [0 0 0 4 1 0 0 0 0 1 0]
 [0 0 0 5 3 0 0 0 1 1 0]
 [0 0 0 2 0 0 0 0 0 0 0]
 [0 0 0 3 3 0 0 0 0 0 0]
 [0 0 0 2 1 0 0 0 2 1 1]
 [0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 3]]
Decision Tree accuracy is: 18.64%
/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

Plot the Decision Tree¶

In [58]:
fig, axes = plt.subplots(nrows = 1,ncols = 1, figsize = (20,20), dpi=600)
tree.plot_tree(DTclassifier, max_depth = 20, feature_names = X.columns, filled=True)
plt.show()

Decision trees place the nodes with least entropy (most information) at the top of the tree. Meaning that these features are yielding significantly more information than the other features.

Test for Feature Importance¶

We can verify this further by creating a feature importance plot.

In [59]:
fi = DTclassifier.feature_importances_ #feature importance array
fi = pd.Series(data = fi, index = X.columns) #convert to Pandas series for plotting
fi.sort_values(ascending=False, inplace=True) #sort descending
In [60]:
#create bar plot
plt.figure(figsize=(20, 20))
chart = sns.barplot(x=fi, y=fi.index, palette=sns.color_palette("BuGn_r", n_colors=len(fi)))
chart.set_xticklabels(chart.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.show()
/var/folders/24/536gs7r91qzd964t2ppqhs6m0000gn/T/ipykernel_31475/2241717128.py:4: UserWarning:

FixedFormatter should only be used together with FixedLocator

This graph shows us the most important features from our model. In future models we should drop the unimportant variables.

Random Forest¶

In [61]:
from sklearn.ensemble import RandomForestClassifier

RFclassifier = RandomForestClassifier(max_leaf_nodes=30)
RFclassifier.fit(x_train, y_train)

y_pred = RFclassifier.predict(x_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

RFAcc = accuracy_score(y_pred,y_test)
print('Random Forest accuracy is: {:.2f}%'.format(RFAcc*100))
              precision    recall  f1-score   support

           1       0.50      0.25      0.33         4
           2       0.00      0.00      0.00         5
           3       0.00      0.00      0.00         7
           4       0.17      0.25      0.20         8
           5       0.08      0.17      0.11         6
           6       0.17      0.10      0.12        10
           7       0.00      0.00      0.00         2
           8       0.33      0.17      0.22         6
           9       0.43      0.43      0.43         7
          11       0.38      0.75      0.50         4

    accuracy                           0.20        59
   macro avg       0.20      0.21      0.19        59
weighted avg       0.20      0.20      0.19        59

[[1 0 0 0 0 1 1 0 0 1]
 [1 0 0 2 1 0 0 0 1 0]
 [0 0 0 0 0 1 5 0 0 1]
 [0 0 0 2 4 1 0 0 0 1]
 [0 0 0 2 1 1 0 1 0 1]
 [0 1 0 2 3 1 0 0 3 0]
 [0 0 0 1 0 0 0 1 0 0]
 [0 0 0 0 4 0 1 1 0 0]
 [0 0 0 2 0 1 0 0 3 1]
 [0 0 0 1 0 0 0 0 0 3]]
Random Forest accuracy is: 20.34%
/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

Model Comparison¶

In [62]:
compare = pd.DataFrame({'Model': ['Logistic Regression', 'K Neighbors', 'SVM', 'Gaussian NB', 'Decision Tree', 'Random Forest'], 
                        'Accuracy': [LRAcc*100, KNAcc*100, SVCAcc*100, NBAcc2*100, DTAcc*100, RFAcc*100]})
compare.sort_values(by='Accuracy', ascending=False)
Out[62]:
Model Accuracy
2 SVM 25.423729
0 Logistic Regression 23.728814
1 K Neighbors 20.338983
5 Random Forest 20.338983
4 Decision Tree 18.644068
3 Gaussian NB 13.559322

NBAcc1*100, 'Categorical NB',

In [63]:
sns.set_theme(style="darkgrid")
sns.barplot(data=compare.sort_values(by='Accuracy', ascending=False), x='Model', y='Accuracy', palette="mako_r")
plt.ylabel('Accuracy Percentage')
plt.xlabel('Model')
plt.title('Model Accuracy')
plt.show()
In [64]:
fig = px.bar(compare.sort_values(by='Accuracy', ascending=False), x='Model', y='Accuracy')
fig.update_layout(
    template="seaborn", xaxis={'categoryorder':'total descending'},
    title='Model Accuracy')
fig.show()

Models - Part 2¶

Drop the unimportant Features¶

Lets run the models again, this time dropping the unimportant features.

In [65]:
X.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 178 entries, 0 to 179
Data columns (total 78 columns):
 #   Column                                        Non-Null Count  Dtype
---  ------                                        --------------  -----
 0   Gender_0.0                                    178 non-null    uint8
 1   Gender_1.0                                    178 non-null    uint8
 2   Race_0.0                                      178 non-null    uint8
 3   Race_1.0                                      178 non-null    uint8
 4   Race_2.0                                      178 non-null    uint8
 5   Race_3.0                                      178 non-null    uint8
 6   Race_4.0                                      178 non-null    uint8
 7   Race_5.0                                      178 non-null    uint8
 8   Race_6.0                                      178 non-null    uint8
 9   Race_-1                                       178 non-null    uint8
 10  Suicidality_0.0                               178 non-null    uint8
 11  Suicidality_1.0                               178 non-null    uint8
 12  Suicidality_2.0                               178 non-null    uint8
 13  Voluntary or Involuntary Hospitalization_0.0  178 non-null    uint8
 14  Voluntary or Involuntary Hospitalization_1.0  178 non-null    uint8
 15  Voluntary or Involuntary Hospitalization_2.0  178 non-null    uint8
 16  Prior Hospitalization_0.0                     178 non-null    uint8
 17  Prior Hospitalization_1.0                     178 non-null    uint8
 18  Prior Counseling_0.0                          178 non-null    uint8
 19  Prior Counseling_1.0                          178 non-null    uint8
 20  Voluntary or Mandatory Counseling_0           178 non-null    uint8
 21  Voluntary or Mandatory Counseling_1           178 non-null    uint8
 22  Voluntary or Mandatory Counseling_2           178 non-null    uint8
 23  Voluntary or Mandatory Counseling_3           178 non-null    uint8
 24  Recent or Ongoing Stressor_0                  178 non-null    uint8
 25  Recent or Ongoing Stressor_1                  178 non-null    uint8
 26  Recent or Ongoing Stressor_2                  178 non-null    uint8
 27  Recent or Ongoing Stressor_3                  178 non-null    uint8
 28  Recent or Ongoing Stressor_4                  178 non-null    uint8
 29  Recent or Ongoing Stressor_5                  178 non-null    uint8
 30  Recent or Ongoing Stressor_6                  178 non-null    uint8
 31  Recent or Ongoing Stressor_7                  178 non-null    uint8
 32  Signs of Being in Crisis_0.0                  178 non-null    uint8
 33  Signs of Being in Crisis_1.0                  178 non-null    uint8
 34  Signs of Being in Crisis_3.0                  178 non-null    uint8
 35  Timeline of Signs of Crisis_0.0               178 non-null    uint8
 36  Timeline of Signs of Crisis_1.0               178 non-null    uint8
 37  Timeline of Signs of Crisis_2.0               178 non-null    uint8
 38  Timeline of Signs of Crisis_3.0               178 non-null    uint8
 39  Timeline of Signs of Crisis_-1                178 non-null    uint8
 40  Leakage _0.0                                  178 non-null    uint8
 41  Leakage _1.0                                  178 non-null    uint8
 42  Leakage How_0                                 178 non-null    uint8
 43  Leakage How_1                                 178 non-null    uint8
 44  Leakage How_2                                 178 non-null    uint8
 45  Leakage How_3                                 178 non-null    uint8
 46  Leakage How_4                                 178 non-null    uint8
 47  Leakage How_5                                 178 non-null    uint8
 48  Leakage How_6                                 178 non-null    uint8
 49  Leakage How_-1                                178 non-null    uint8
 50  Leakage Who _10                               178 non-null    uint8
 51  Leakage Who _-1                               178 non-null    uint8
 52  Leakage Who _0                                178 non-null    uint8
 53  Leakage Who _1                                178 non-null    uint8
 54  Leakage Who _2                                178 non-null    uint8
 55  Leakage Who _3                                178 non-null    uint8
 56  Leakage Who _4                                178 non-null    uint8
 57  Leakage Who _5                                178 non-null    uint8
 58  Leakage Who _6                                178 non-null    uint8
 59  Leakage Who _7                                178 non-null    uint8
 60  Leakage Who _8                                178 non-null    uint8
 61  Leakage Who _9                                178 non-null    uint8
 62  Age_binned_<20s                               178 non-null    uint8
 63  Age_binned_20s                                178 non-null    uint8
 64  Age_binned_30s                                178 non-null    uint8
 65  Age_binned_40s                                178 non-null    uint8
 66  Age_binned_50s                                178 non-null    uint8
 67  Age_binned_60s                                178 non-null    uint8
 68  Age_binned_>60s                               178 non-null    uint8
 69  Number_Killed_binned_Low                      178 non-null    uint8
 70  Number_Killed_binned_Medium                   178 non-null    uint8
 71  Number_Killed_binned_High                     178 non-null    uint8
 72  Number_Injured_binned_Low                     178 non-null    uint8
 73  Number_Injured_binned_Medium                  178 non-null    uint8
 74  Number_Injured_binned_High                    178 non-null    uint8
 75  Casualties_binned_Low                         178 non-null    uint8
 76  Casualties_binned_Medium                      178 non-null    uint8
 77  Casualties_binned_High                        178 non-null    uint8
dtypes: uint8(78)
memory usage: 14.9 KB
In [66]:
X = X[['Age_binned_<20s', 'Voluntary or Mandatory Counseling_1', 'Casualties_binned_Low', 'Leakage How_-1',
        'Suicidality_1.0', 'Number_Killed_binned_High', 'Age_binned_20s', 'Suicidality_2.0', 'Recent or Ongoing Stressor_7',
        'Voluntary or Involuntary Hospitalization_2.0', 'Number_Injured_binned_Low', 'Race_-1', 'Timeline of Signs of Crisis_3.0',
        'Gender_1.0', 'Recent or Ongoing Stressor_3', 'Race_1.0', 'Voluntary or Involuntary Hospitalization_0.0']]
In [67]:
X.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 178 entries, 0 to 179
Data columns (total 17 columns):
 #   Column                                        Non-Null Count  Dtype
---  ------                                        --------------  -----
 0   Age_binned_<20s                               178 non-null    uint8
 1   Voluntary or Mandatory Counseling_1           178 non-null    uint8
 2   Casualties_binned_Low                         178 non-null    uint8
 3   Leakage How_-1                                178 non-null    uint8
 4   Suicidality_1.0                               178 non-null    uint8
 5   Number_Killed_binned_High                     178 non-null    uint8
 6   Age_binned_20s                                178 non-null    uint8
 7   Suicidality_2.0                               178 non-null    uint8
 8   Recent or Ongoing Stressor_7                  178 non-null    uint8
 9   Voluntary or Involuntary Hospitalization_2.0  178 non-null    uint8
 10  Number_Injured_binned_Low                     178 non-null    uint8
 11  Race_-1                                       178 non-null    uint8
 12  Timeline of Signs of Crisis_3.0               178 non-null    uint8
 13  Gender_1.0                                    178 non-null    uint8
 14  Recent or Ongoing Stressor_3                  178 non-null    uint8
 15  Race_1.0                                      178 non-null    uint8
 16  Voluntary or Involuntary Hospitalization_0.0  178 non-null    uint8
dtypes: uint8(17)
memory usage: 4.3 KB

Split the data¶

In [68]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 0)
In [69]:
print('The Shape Of The Original Data: ', model_df.shape)
print('The Shape Of x_test: ', x_test.shape)
print('The Shape Of x_train: ', x_train.shape)
print('The Shape Of y_test: ', y_test.shape)
print('The Shape Of y_train: ', y_train.shape)
The Shape Of The Original Data:  (178, 18)
The Shape Of x_test:  (59, 17)
The Shape Of x_train:  (119, 17)
The Shape Of y_test:  (59,)
The Shape Of y_train:  (119,)

Balance the data¶

In [70]:
sns.set_theme(style="darkgrid")
sns.countplot(y=y_train, data=model_df, palette="mako_r")
plt.ylabel('Location')
plt.xlabel('Total')
plt.yticks([0, 1,2,3,4,5,6,7,8,9,10,], [ 'K-12 school','College/university','Government building / \nplace of civic importance',
                    'House of worship','Retail','Restaurant/bar/nightclub','Office','Place of residence',
                    'Outdoors','Warehouse/factory', 'Post office'])
plt.title('Unbalanced Data')
plt.show()
In [71]:
x_train, y_train = SMOTE(k_neighbors=1).fit_resample(x_train, y_train)
In [72]:
sns.set_theme(style="darkgrid")
sns.countplot(y=y_train, data=model_df, palette="mako_r")
plt.ylabel('Location')
plt.xlabel('Total')
plt.yticks([0, 1,2,3,4,5,6,7,8,9,10,], [ 'K-12 school','College/university','Government building / \nplace of civic importance',
                    'House of worship','Retail','Restaurant/bar/nightclub','Office','Place of residence',
                    'Outdoors','Warehouse/factory', 'Post office'])
plt.title('Balanced Data')
plt.show()

Logistic Regression¶

In [73]:
LRclassifier = LogisticRegression(solver='liblinear', max_iter=5000)
LRclassifier.fit(x_train, y_train)

y_pred = LRclassifier.predict(x_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

from sklearn.metrics import accuracy_score

LRAcc = accuracy_score(y_pred,y_test)
print('Logistic Regression accuracy is: {:.2f}%'.format(LRAcc*100))
              precision    recall  f1-score   support

           1       0.00      0.00      0.00         4
           2       0.20      0.20      0.20         5
           3       0.00      0.00      0.00         7
           4       0.17      0.12      0.14         8
           5       0.22      0.33      0.27         6
           6       0.00      0.00      0.00        10
           7       0.00      0.00      0.00         2
           8       0.00      0.00      0.00         6
           9       0.20      0.57      0.30         7
          11       0.30      0.75      0.43         4

    accuracy                           0.19        59
   macro avg       0.11      0.20      0.13        59
weighted avg       0.11      0.19      0.13        59

[[0 0 0 1 0 0 0 0 1 2]
 [0 1 0 0 1 0 1 0 1 1]
 [0 1 0 0 0 0 1 0 3 2]
 [1 0 0 1 3 0 0 0 3 0]
 [0 1 0 2 2 0 0 0 1 0]
 [0 1 0 1 1 0 2 0 5 0]
 [0 0 0 1 0 0 0 0 1 0]
 [0 1 0 0 2 1 1 0 1 0]
 [0 0 0 0 0 0 1 0 4 2]
 [0 0 0 0 0 0 1 0 0 3]]
Logistic Regression accuracy is: 18.64%
/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

K-Nearest Neighbors¶

In [74]:
KNclassifier = KNeighborsClassifier(n_neighbors=20)
KNclassifier.fit(x_train, y_train)

y_pred = KNclassifier.predict(x_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

KNAcc = accuracy_score(y_pred,y_test)
print('K Neighbours accuracy is: {:.2f}%'.format(KNAcc*100))
              precision    recall  f1-score   support

           1       0.00      0.00      0.00         4
           2       0.11      0.20      0.14         5
           3       0.00      0.00      0.00         7
           4       0.00      0.00      0.00         8
           5       0.11      0.33      0.16         6
           6       0.00      0.00      0.00        10
           7       0.00      0.00      0.00         2
           8       0.00      0.00      0.00         6
           9       0.11      0.14      0.12         7
          11       0.40      0.50      0.44         4

    accuracy                           0.10        59
   macro avg       0.07      0.12      0.09        59
weighted avg       0.06      0.10      0.07        59

[[0 0 0 0 1 1 0 0 1 1]
 [0 1 0 2 1 0 0 0 1 0]
 [0 2 0 1 1 1 1 0 1 0]
 [0 0 0 0 3 1 1 1 2 0]
 [0 3 0 0 2 0 0 0 1 0]
 [0 2 0 0 5 0 1 0 2 0]
 [0 0 0 1 0 1 0 0 0 0]
 [0 0 0 0 5 0 1 0 0 0]
 [0 0 0 0 1 0 3 0 1 2]
 [0 1 0 0 0 0 0 1 0 2]]
K Neighbours accuracy is: 10.17%
/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

SVM Model¶

In [75]:
SVCclassifier = SVC(kernel='linear', max_iter=251)
SVCclassifier.fit(x_train, y_train)

y_pred = SVCclassifier.predict(x_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

SVCAcc = accuracy_score(y_pred,y_test)
print('SVC accuracy is: {:.2f}%'.format(SVCAcc*100))
              precision    recall  f1-score   support

           1       0.00      0.00      0.00         4
           2       0.00      0.00      0.00         5
           3       0.00      0.00      0.00         7
           4       0.17      0.38      0.23         8
           5       0.50      0.17      0.25         6
           6       0.00      0.00      0.00        10
           7       0.00      0.00      0.00         2
           8       0.00      0.00      0.00         6
           9       0.15      0.57      0.24         7
          11       0.50      0.75      0.60         4

    accuracy                           0.19        59
   macro avg       0.13      0.19      0.13        59
weighted avg       0.12      0.19      0.13        59

[[0 0 0 1 0 0 0 0 2 1]
 [0 0 0 3 0 0 0 0 2 0]
 [0 0 0 1 0 0 1 0 4 1]
 [0 0 0 3 0 1 1 0 3 0]
 [0 0 0 3 1 0 1 0 1 0]
 [0 1 0 2 0 0 0 0 7 0]
 [0 0 0 2 0 0 0 0 0 0]
 [0 0 0 1 1 0 1 0 3 0]
 [0 0 0 2 0 0 0 0 4 1]
 [0 0 0 0 0 0 0 0 1 3]]
SVC accuracy is: 18.64%
/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/svm/_base.py:301: ConvergenceWarning:

Solver terminated early (max_iter=251).  Consider pre-processing your data with StandardScaler or MinMaxScaler.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

2 Gaussian Naive Bayes¶

In [76]:
NBclassifier2 = GaussianNB()
NBclassifier2.fit(x_train, y_train)

y_pred = NBclassifier2.predict(x_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

NBAcc2 = accuracy_score(y_pred,y_test)
print('Gaussian Naive Bayes accuracy is: {:.2f}%'.format(NBAcc2*100))
              precision    recall  f1-score   support

           1       0.00      0.00      0.00         4
           2       0.00      0.00      0.00         5
           3       0.00      0.00      0.00         7
           4       0.11      0.38      0.17         8
           5       0.00      0.00      0.00         6
           6       0.00      0.00      0.00        10
           7       0.00      0.00      0.00         2
           8       0.00      0.00      0.00         6
           9       0.21      0.71      0.32         7
          10       0.00      0.00      0.00         0
          11       0.00      0.00      0.00         4

    accuracy                           0.14        59
   macro avg       0.03      0.10      0.04        59
weighted avg       0.04      0.14      0.06        59

[[0 0 0 3 0 0 0 0 1 0 0]
 [0 0 0 2 0 0 1 0 2 0 0]
 [0 0 0 3 0 0 1 0 2 1 0]
 [0 0 0 3 0 0 0 0 4 1 0]
 [0 0 0 5 0 0 0 0 1 0 0]
 [0 0 0 2 0 0 0 0 6 2 0]
 [0 0 0 2 0 0 0 0 0 0 0]
 [0 0 0 2 0 0 1 0 3 0 0]
 [0 0 0 2 0 0 0 0 5 0 0]
 [0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 3 0 0 1 0 0 0 0]]
Gaussian Naive Bayes accuracy is: 13.56%
/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

Decision Tree¶

In [77]:
DTclassifier = DecisionTreeClassifier(max_leaf_nodes=20)
DTclassifier = DTclassifier.fit(x_train, y_train)

y_pred = DTclassifier.predict(x_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

DTAcc = accuracy_score(y_pred,y_test)
print('Decision Tree accuracy is: {:.2f}%'.format(DTAcc*100))
              precision    recall  f1-score   support

           1       0.22      0.50      0.31         4
           2       0.00      0.00      0.00         5
           3       0.33      0.14      0.20         7
           4       0.00      0.00      0.00         8
           5       0.00      0.00      0.00         6
           6       0.20      0.10      0.13        10
           7       0.00      0.00      0.00         2
           8       0.00      0.00      0.00         6
           9       0.00      0.00      0.00         7
          10       0.00      0.00      0.00         0
          11       0.60      0.75      0.67         4

    accuracy                           0.12        59
   macro avg       0.12      0.14      0.12        59
weighted avg       0.13      0.12      0.11        59

[[2 0 0 0 1 0 0 0 1 0 0]
 [1 0 0 0 1 1 1 0 1 0 0]
 [0 0 1 0 1 1 0 0 3 1 0]
 [0 0 1 0 1 1 0 0 4 0 1]
 [4 0 0 0 0 1 1 0 0 0 0]
 [1 1 0 0 1 1 0 0 4 2 0]
 [1 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 3 0 0 0 3 0 0]
 [0 0 0 0 1 0 3 0 0 2 1]
 [0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 3]]
Decision Tree accuracy is: 11.86%
/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/jeremysloan/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1327: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

Plot the Decision Tree¶

In [78]:
fig, axes = plt.subplots(nrows = 1,ncols = 1, figsize = (20,20), dpi=600)
tree.plot_tree(DTclassifier, max_depth = 20, feature_names = X.columns, filled=True)
plt.show()

Test for important features¶

In [79]:
fi = DTclassifier.feature_importances_ #feature importance array
fi = pd.Series(data = fi, index = X.columns) #convert to Pandas series for plotting
fi.sort_values(ascending=False, inplace=True) #sort descending
In [80]:
#create bar plot
plt.figure(figsize=(25, 20))
chart = sns.barplot(x=fi, y=fi.index, palette=sns.color_palette("mako_r", n_colors=len(fi)))
chart.set_xticklabels(chart.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.show()
/var/folders/24/536gs7r91qzd964t2ppqhs6m0000gn/T/ipykernel_31475/2187630576.py:4: UserWarning:

FixedFormatter should only be used together with FixedLocator

Random Forrest¶

In [81]:
RFclassifier = RandomForestClassifier(max_leaf_nodes=30)
RFclassifier.fit(x_train, y_train)

y_pred = RFclassifier.predict(x_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

RFAcc = accuracy_score(y_pred,y_test)
print('Random Forest accuracy is: {:.2f}%'.format(RFAcc*100))
              precision    recall  f1-score   support

           1       0.14      0.25      0.18         4
           2       0.00      0.00      0.00         5
           3       0.00      0.00      0.00         7
           4       0.25      0.12      0.17         8
           5       0.11      0.17      0.13         6
           6       0.00      0.00      0.00        10
           7       0.00      0.00      0.00         2
           8       0.00      0.00      0.00         6
           9       0.07      0.14      0.10         7
          11       0.38      0.75      0.50         4

    accuracy                           0.12        59
   macro avg       0.10      0.14      0.11        59
weighted avg       0.09      0.12      0.09        59

[[1 0 1 0 0 0 0 0 1 1]
 [2 0 1 0 0 0 0 0 2 0]
 [0 0 0 2 0 0 1 0 3 1]
 [0 0 0 1 2 0 0 2 2 1]
 [3 0 0 0 1 0 0 1 1 0]
 [1 1 0 0 2 0 2 0 4 0]
 [0 0 0 1 0 1 0 0 0 0]
 [0 0 0 0 4 0 2 0 0 0]
 [0 0 0 0 0 0 4 0 1 2]
 [0 0 0 0 0 0 0 1 0 3]]
Random Forest accuracy is: 11.86%

Comparison¶

In [82]:
compare = pd.DataFrame({'Model': ['Logistic Regression', 'K Neighbors', 'SVM', 'Gaussian NB', 'Decision Tree', 'Random Forest'], 
                        'Accuracy': [LRAcc*100, KNAcc*100, SVCAcc*100, NBAcc2*100, DTAcc*100, RFAcc*100]})
compare.sort_values(by='Accuracy', ascending=False)
Out[82]:
Model Accuracy
0 Logistic Regression 18.644068
2 SVM 18.644068
3 Gaussian NB 13.559322
4 Decision Tree 11.864407
5 Random Forest 11.864407
1 K Neighbors 10.169492
In [83]:
sns.set_theme(style="darkgrid")
sns.barplot(data=compare.sort_values(by='Accuracy', ascending=False), x='Model', y='Accuracy', palette="mako_r")
plt.ylabel('Accuracy Percentage')
plt.xlabel('Model')
plt.title('Important Feature Model Accuracy')
plt.show()

The models appear to be significantly less accurate after dropping the unimportant features.

Created in deepnote.com Created in Deepnote